The blood virome of 10,585 individuals from the ChinaMAP

Understanding the composition of human blood virome is essential for safe blood transfusions and infectious disease surveillance and control. Screening the natural populations using sequencing technologies for the detection of known and novel viral pathogens could provide critical information for epidemiology and prevention of viral infections, vaccine development, and virus genomic investigations 1 . In addition, numerous common cancers are associated with oncogenic viruses, including Epstein-Barr virus (EBV), hepatitis B and C viruses (HBV and HCV), and human papillomavirus (HPV) 2 . The geo-graphical and genetic diversity of population virome could be exploited for public health screening and prevention. Therefore, we analyzed the nonhuman sequencing reads in the 40× whole-genome sequencing (WGS) dataset of 10,585 individuals from the ChinaMAP 3 to investigate the blood virome in the Chinese population. Overall, we ChinaMAP and


Identification of viral sequences
To identify viral sequences more accurately, we constructed a database including sequences from humans, Archaea (603 sequences), Bacteria (50062 sequences), and known vectors (3137 sequences) and a customized viral database composed of NCBI RefSeq, DDBJ and EMBL viral sequences (25644 viral sequences, date of download: 20201130). First, high-quality unmapped reads were aligned to our database using Kraken2 4 v.2.1.1, and then viral reads were identified and extracted according to the taxonomy id. Candidate reads were searched against our customized viral database by BLAST 5 v.2.7.1. Reads were then annotated at the species level. Viral hits were counted only if they met two requirements as follows: reads were aligned to only one species, reads had an e-value < 1e-5 and alignment length ≥ 80 bp. Three samples were filtered due to failure of quality control. We further filtered the reads that mapped to the same viral genome positions with very high frequency and the coverage distribution does not fit a Poisson distribution 6 , as these reads could possibly be due to the contamination of laboratory components 7 . Viral abundance was estimated by the following equation 8

Subtype analysis and clusters of HBVs
HBV subtypes were acquired when annotations were executed at the subtype level. Reads from HBV subtype B and subtype C were aligned to Japan AB540582.

Detection of virus integration events
After identification of viral reads by BLAST aligner, the paired-end reads that matched a viral species with only one mate were used to detect potential events of integration between the viral genome and the human genome. We investigated the BWA-alignment results of these unmatched mate reads. Both the perfect alignments to human genome and the split alignments where one part mapped to the human genome and the other part mapped to the same virus genome with the BLAST aligner were considered to be integration events.

Acrosin protein structure prediction
Conservation analysis of six different acrosins (P10323, P23578, P29293, P08001, P48038 and P10626), whose sequences were obtained from UniProt, was performed using Muscle 18 and Jalview 19 . AlphaFold2 20 was used to predict the protein structure of the wild-type and missense mutant (T24M) acrosin using the human acrosin amino acid sequence. The predicted structures with the highest confidence in PDB format were selected as PyMOL2 (The PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC) input. Next, we used PyMOL2 to select all residues within 3.5 angstroms of ACR missense variant and then analyzed polar contacts changes. Ten repeated protein structure predictions (PDB files) of the wild-type and T24M mutation acrosin sequences were generated.

GWAS
To identify host genetic polymorphisms associated with viral infections, we performed a GWAS based on 9845 unrelated individuals without family relationship 21 and compared infected individuals to uninfected individuals in our study. Quality control procedures of SNPs included the following: 1) having a median depth greater than 8; 2) being within a lowcomplexity region (less than 7 single base repeat units); 3) having genotyping rate ≥ 90%; and 4) presenting HWE > 0.000001. EPACTS was employed to detect the associated signal with the top two principal components, age and gender, as covariates. The significantly associated loci were determined using a P value threshold < 5 × 10 -8 .

Statistical analysis
T-test was used to analyze the abundance difference between integrated and non-integrated HBV-infected samples. Statistical analysis of top 6 viruses' abundance across age and gender were also estimated using t-test. The difference in the abundance of geographical distribution was performed by ANOVA analysis.